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ABSTRACT 

A study reports a set of labelling criteria which 
have been developed to label prosodic events in clear, continuous 
speech, and proposes a scheme whereby this information can be 
transcribed in a machine readable format. A prosody in a syllabic 
domain which is synchronized with a phonemic segmentation was 
annotated. A procedural definition of syllables based on the grouping 
of phones is presented. The criteria for hand labelling the 
prominence of each syllable, tone-unit boundaries and the pitch 
movement associated with each accented syllable, are described. Work 
to automate this process is presented and experimental results 
evaluating its performance are included. (Three tables of data are 
included; contains 18 references.) (Author/RS) 
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ABSTRACT 

We report a set of labelling criteria which have been developed 
to label prosodic events in dear, continuous speech, and propose 
i a scheme whereby this information can be transcribed in a ma* 
, chine readable format. We have chosen to annotate prosody in a 
^ syllabic domain which is synchronised with a phonemic segmen- 
tation. A procedural definition of syllables based on the grouping 
of phones is presented. The criteria for hand labelling the promi- 
nence of each syllable, tone-unit boundaries and the pitch move- 
ment associated with each accented syllable, are described. Work 
to automate this process is presented and experimental results 
evaluating its performance are included. 

1. INTRODUCTION . 
The need for a large corpus of prosodically labelled English 
speech is motivated by the use of prosodic events in training 
speech synthesisers, in automated foreign language pronimcia- 
tion teadiing, and to aid parsers used in speech recognition to 
disambiguate phonetically similar, but syntactically different ut- 
terances. 

Speech synthesis requires a mapping from prosodic events to 
a set of acoustic parameters for their realisation. Parsers and 
the analysis of language pronunciation, on the other hand, re- 
quire the reverse mapping to provide descriptors for the acoustic 
correlates of prosody, and semantic and pragmatic knowledge to 
be extracted from these correlates. The prosodic labelling of a 
language corpiis miist therefore annotate both the linguistically 
significant features in speech prosody and the inflections of the 
acoustic parameters. 

We aim to transcribe sentential stress (the prominence of syl- 
lables in continuous speech) and the pitch movement associated 
with any accented syllables for such systems. By initially hand la- 
belling these prosodic aspects, a set of acoustic features are sought 
which will form a mapping for speech synthesis, and at the same 
time, enable these prosodic events to be labelled automatically 
given the acoustic featiires, for parsers and language pronuncia- 
tion description. The transcription system we propose is intended 
to be an annotation scheme for linguistically significant prosodic 
events in English. It is not designed to give a detailed description 
of every possible inflection in an FO contour. The set of symbols 
(see table 1) is designed for use by both a hand transcriber of the 
(ow prosodic events and for some automated procedure. 
\ The labelling scheme described has been used to transcribe, 
by hand, prosodic events in a database of 453 utterances from 
the English language ATR conference-registration dialogues with 
focus^. An acoustic analysis of these labels attempts to estab- 
lish a correlation between a set of features chosen to characterise 
the acoustic parameters believed to manifest prosody, and the 

^ ^The ATR dialo^es where spoken by a female bilin^al speaker of 
\^ Japanese and American Engliih. 
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perceived prosodic events that are transcribed. 

Continuous speech is initially segmented into phone units and 
labelled using a HMM-based automatic segmenter (evaluated m 
[18]). The phones identified are grouped into syllables. Syllable 
boundaries are thus synchronised with the phone boundaries. The 
procedure employed to group phones into syllables is described in 
section II. Each syllable is labdled by hand as imstressed, stressed 
(but not accented), stressed and accented (but not nuclear), or as 
the riudear accented syllable of a tone-unit. Each syllable imme- 
diatdy preceding a tone-unit boundary is also marked, in order tc 
specify the boundary location. The nuclear accented syllable of 
a tone-xmit is (according to the "British Schoor of intonational 
phonology) the final accented syllable in that tone-unit [5]. This 
definition of nuclear syllables and the criteria used to determine 
syllable prominence are addressed in section III. Each accented 
syllable is associated with an additional label that describes the 
pitch contour movement which marked it. Thus, pitch contou: 
labelling is also synchronised with syllable boundaries. The time 
location of this movement may occur before, during and sub- 
sequent to the domain of the accented syllable. Pitch contour 
labelling criteria are described in section. IV. In section V a set 
of acoustic features are proposed which we intuitively feel will 
describe the acoustic correlates of sentential stress. These acous- 
tic features are used to form a tree-based statistical model for i 
small corpus of hjuid labelled prosodic events. This methodology 
is described in section VI. Its application reveals a low correlation 
between the acoustic feattires and the events labelled, which poses 
questions regarding the relationship* between the theory and the 
acoustics of sentential stress. These are discussed in section VII. 

n. SYLLABIFICATION 
The following procedural definition is used for syUabification. 

i) Phones are grouped into syllables on a phonological rather 
thzm phonetic basis. Consonantal phones (such as [m, n, 1, r, s]) 
which may result in schwa deletion [10, pp.297-299] [6] and take 
on the syllabic nucleus, axe therefore syllabified as if the vowels 
were present. Hence, shortest in rapid speech is syllabified a^ 
/7 0 - t ? t/, and additional as /» - *d I - / 1^- J/- A glottal stop 
that may occur before or instead of a word-final stop is treated 
as an instance of the underlying stop phone, and any glottalisec 
onset to vowels is considered to be part of the vowel. 

ii) Syllable boundaries are formed from the boimdaries o: 
words considered in isolation. Although in continuous speech, 
consonants at the end of one word can syllabify with the initial 
vowel of the following word [13], such resyllabification is not nec- 
essary in forming a domain in which to describe prosodic events. 
Thus, for example, the syllabification of at all differs to that o 
a tall even if /t/ is aspirated in both cases and they axe pho 
netically identical. This app^:%ch has been adopted because tht 
exact boimdaries between syilJble nuclei are not of critical im 



portance, although identifying the nudeax phone is. Simil&rly, 
resyllabific&tion is unnecessary across words that appear to blend 
together due to vowel deletion, as may be the case in under a, 
which is syllabified a* /'a n - d f - 

iii) The boundaries between syllables are also determined by 
the presence of a morphological boundary. The boundary be- 
tween a free morpheme and an inflectional suffix (except •s) or 
a class*II derivational affix is taken to be a syllable boundary. 
Thus, hopdess is syllabified as /'h 9U p - 1 o s/ rather than 
/*h «0 ~ p 1 « s/; and uninteresting is syllabified as /An-*ln- 
ta-rcst-l^/ rather than /a - *n I n - t d - r c - s 1 1 5/. 

iv) On the basis of English phonotactics, any cluster of phones 
forming the onset or the coda of a syllable must also be a per* 
missible word*initial or word*final cluster. According to this rule, 
extra may be syllabified as /'e k - s t r »/, /'e k s - t r »/, or 
Akst-rV. 

v) The 'maximal onset (and minimal coda) principle' [16] [4, 
pp.I0~18] arbitrates between competing analyses. According to 
the principle, as many consonantal phones as possible form a syl- 
lable onset. Using this principle, extra would be syllabified as 
/'e k s t r 9/. However, in cases when altemative'boundaries 
nxe possible, stressed syllables tend to attract consonants more 
than unstressed ones, particulzLrly in the case of ambisyllabic con- 
sonants such as [s, f] [8, pp.19-23]. When this final criterion 
is applied, the syllabification adopted for the example becomes 
/'e k s ~ t r V- 

m. SENTENTIAL STRESS LABELLING 

The salience of each syllable within an utterance is labelled 
as one of {u, s, a, n} (see table 1) on the following basis. 

Sententially stressed syllables are those that aure perceived as 
salient due to a prominence of energy and/or duration and/or 
pitch [7] [12, chap 4] within an utterance. The default (and 
therefore intonationally unmarked) pitch movement in English 
is a slight downwards trend in pitch [5, 11]. This movement does 
not give any intonational prominence to the syllable within the 
declination, even if that syllable is stressed on the grounds of 
prominent duration and /or intensity. The same suuation occurs 
if a stressed but unaccented syllable is one in a series of gently ris* 
ing pitch movements. Where there is no pitch discontinuity, there 
is no accent [5]. An accented syllable must dlso be a stressed syl- 
lable and an accompanying pitch movement jxxaai occur during 
the accented syllable or on a syllable before or subsequent to the 
perceived accented syllable [9]. 

Each tone-imit of an utterance will have one peak of promi- 
nence in the form of a nuclear pitch movement. The nuclear 
accented syllable is the syllable on which the one obligatory pitch 
movement occurs in a tone-unit. This is traditionally believed to 
be the final accented syllable in a tone-unit [IS]. At present, we 
make use of this traditional definition. 

Tone-imit boundaries are marked by placing a diacritic {:} on 
the label {u, s, a, n} of the syllable immediately preceding the 
boimdary. The tone-unit boimdaries are identified by two pho- 
netic features [5, pp.204-207]. Firstly, the presence of jimctiiral 
features, such as slight pauses, final lengthening and rhythmic 
discontinuities, can signal the end of a tone-unit. However, a 
pause does not necessarily correspond with a tone-unit boundary 
in spontaneous speech, particularly in cases of disfluency. Sec- 
ondly, given that the first prominent syllable, for the majority 
of tone-tmits in an utterance, is of approximately the same pitch 
level [5], the boundary may be signaled by some perceivable pitch 
change. This change can be either a step up &om a falling pitch 
movement, or a step down from a rising pitch movement. It may 
O lifficult to identify such pitch resets when the tone-imit onset 
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Table 1: Symbols for Sentential Stress and Pitch Movement La 
belling 



Ascnt 


Symbol 


Description 


u 


{«} 


«~ Completely unstressed 


t 


f N 


Stressed but unaccented 


a 


w 


— Stressed and accented 


n 


{«} 


— Nuclear accented 


1 pipe 


{:} 


r— syllable immediately preceding a 






tone-unit boundary 


\ 


{\) 


pitch accent is a fail 


/ 


{/} 


— pitch accent is a rise 


V 


V*} 


*— accent is a fall-rise 


• hat 


{A} 


— accent is a rise-fall 


I 


{-} 


— level tone 


< 


{-) 


— pitch movement is part of the 



realisation of an accented syllable to 
theleft of this syllable 
— pitch movement is part of the 
realisation of an accented syllable to 
the right of this syllable 
the range of the pitch movement is 
unusually wide (increased) 
the range of the pitch movement is 
unusually narrow (decreased) 
pitch ^'peak'* or level tone pitch is 
unusually high 

pitch •'peak'* or level tone pitch is 
unusually low 
{I } — initial part of {v} or {A} pitch 

movement is shallow 
{ 1} — final part of {v} or {A} pitch 
movement is shallow 



- minus {"} 

- underscore {-} 
' apostrophe {A} 
, comma {v} 

[ 



fThe ASCII characters listed are the prosodlc labels used in machine 
readable data. 

is low and the final accent of the previous tone-imit ends with a 
pitch faU, or the onset is high and follows a tone-unit whose fined 
accent ends with a rise in pitch. 

IV. PITCH MOVEMENT LABELLING 
The pitch contour of an utterance is labelled as a series of 
pitch movements at (or near) each accented syllable. A pitch 
movement is either a continuous pitch glide, for example over a 
long vocalic section of speech, or a discrete pitch jump from one 
level to another over a series of syllables. Eadi pitch movement in 
an utterance is labelled as out of the five categories {\, /, V, A, — } 
(see table 1). 

A description is auodated with each and every syllable la- 
belled as accented (or nuclear accented) to mark the direction of 
pitch movement on this and any following unaccented syllables. 
These labds should only be time aligned with an unstressed or an 
accented (nuclear or otherwise) syllable {u, a, n}, but not with 
a stressed (but unaccented) syllable {s}. (Any stressed syllable 
corresponding with a time aligned pitch movement label should 
be marked as an accented syllable.) If the pitch movement is 
aligned with an unstressed syllable {u}, a diacritic is applied to 
the pitch movement label in order to indicate whether the pitch 
movement is part of the realisation of the nearest accented sylla- 
ble {a, n} to the left or the nearest one to the right 
There may be more than one pitch movement associated with an 
accented syllable; for example, if there is a rise-fall pitch move- 
ment in the realisation of an accented syllable but the rise occurf 
on a preceding unstressed syllable and the fall occurs on a suc- 
ceeding unstrMsed syllable. The \ises of these diacritics enable the 
inflections of the FO contour to be described while mamtaining a 



' transcription of the perceived pitch movement* 

Pitch range markings are used to describe the extent of the 
movement in a pitch glide and the distance between levels of a 
pitch jump, but not for level tone. If the pitch range is distinc- 
tively wider or narrower than expected for a particular contrastive 
effect, it is marked with a diacritic on the pitch direction 

labels. Diacritics are also applied to these labels if the '^eak'* 
part of a pitch movement (the initial part of a fall {\}, the final 
part of a rise {/}, and the mid-section of a faU*rise or rise-fall 
{V, A}) or the pitch of a level tone {-} is unusually high {A} or 
low {v} for the particular speaker. In order to describe occur- 
rences of pitch fall-rise and rise-fall with a particularly shallow 
rise or shallow fall, two further diacritics are included. These are 
used to represent, for example, fall-shallow rise as {Nf }. 

V, ACOUSTIC FEATURES 
A set of acoustic features must be extracted from the raw 
speech waveform in order to automatically identify syllable promi- 
nence and pitch movements. In our preliminary stages of produc- 
ing an automatic prosodic labelling algorithnci, eighteen features 
are used to describe what are we believe to be the acotistic corre- 
lates of stress (duration, intensity and fundamental frequency). 

The energy and fundamental frequency of the speech wave- 
form (sampled at 20kHz) are measured for 20ms frames of speech 
at 5tns intervals so that values are synchronised with the ccp- 
stral coefficients and lower three fonnant frequencies used in the 
auto-segmentation process. The fundamental frequency (FO) is 
determined using a slightly enhanced version of the pitch tracker 
described in [14]. In order to measure the signal energy, each 
frame is passed through a Blatckman- Harris window and an am- 
plitude spectrum is calculated using a 512-point FFT. The to- 
tal energy for the frequency range of 50Hz-2kHz is determined 
by summation of the corresponding frequency bins. Each frame 
energy value is then expressed in decibels with respect to the 
maximum frame energy for the utterance. This process forms 
an utterance-normalised sonorant energy contour. Both the raw 
FO contour and the energy contour are smoothed using a 3-point 
non-linear median filter and a 5-point banning window [17]. 

The phone given by auto- segmentation which forms the nu- 
cleus of a given syllable is identified by the following procedure. 
The phones in the syllable are split into two groups on the basis 
of whether or not they are a member of the set of vocalic phones 
and potentially syllabic consonantal phones (currently, all vowels 
plus [1, m, n, r]). Each phone is associated with the maximum 
sonorant energy within its tenure. If there are phones in the sylla- 
ble which are members of this set, then the one whose associated 
energy is greatest, is selected as the syllable nucletis. Otherwise, 
none of the syllable phones are [vowel, 1, m, n, r] and the phone 
with the greatest maximum sonorant energy is selected. The du- 
ration associated with any syllable in determining its prominence 
is the duration of its nuclear phone - this will be referred to at 
the ''syllable duration". Using the duration of the entire syllable 
or the duration of all consecutive sonorants in the syllable as this 
measure has not yet been investigated. 

Each syllable in au utterance is characterised by the maxi- 
mum soronant energy within its tenure (syllable energy), its ''syl- 
lable duration^, the maximum FO value within its tenure, the FO 
values at the beginning and at the end of the syllable, and an 
FO slope in Hz per second which describes the rate of change in 
FO through any voiced regions of the syllable. The syllable en- 
ergy and "syllable duration" are Z-score noimalised to eliminate 
phone-specific effects [2]. For each phone type, the mean and 
population standard deviation of the syllable energy /duration is 
ij rmined. Then, for each token of that phone type, the syllable 
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Table 2: Confusion Matrix of Sentential Stress Labelling by Hanc 
and by Automation - cyclic excltision of each utterance during 





At]tomft.t]r T.aKjil 








total 


o,7» 








(12.3%) (1.0%) (11.7%) 


(25.0%) 


n cuivi itfAUCJi • 


237 72 67Z 


982 




(3.3%) (1.0%) (9.3%) 


(13.6%) 




567 142 3731 


4440 




(7.8%) (2.0%) (51.6%) 


(61.4%) 


total 


1693 286 5253 


7232 




(23.4%) (4.0%) (72.6%) 


(100.0%) 



Misciassification error rate = 2540/7232 (35.1%) 

Table 3: Confusion Matrix of Sentential Stress Labelling by Hand 
and by Automation - all utterances used during training 





Automatic Label 






a,n s 1ft 


total 


a,n 


1143 44 6?.3 


1810 




(15.8%) (0.6%) (8.6%) 


(25.0%) 


Haad Label s 


240 113 629 


982 




(3.3%) (1.6%) (8.7%) 


(13.6%) 


1ft 


334 51 4055 


4440 




(4.6%) (0.7%) (56.1%) 


(61.4%) 


total 


1717 208 5307 


7232 




(23.7%) (2.9%) (73.4%) 


(100.0%) 



Misciassification error rate s= 1921/7232 (26.6%) 

energy/duration is normalised by subtracting the mean and di- 
viding by the population st2uidard c'eviation. Hence, for each syl- 
lable, there are six acoustic feattires extracted - phone>normaIised 
duration, phone-normalised energy, maximum FO, start^time FO, 
stop-time FO, and FO slope. In automatically establishing the 
prominence of any syllable in an utterance, these six features for 
the current, previous and next syllable are us*d, giving a total of 
eighteen features per syllable. 

The FO features are also normalised so that each movement 
is independent of its absolute FO values. Our intuition suggests 
that FO change is the significant factor, not the absolute FO val- 
ues. Normalisation of the nine FO parameters (the maximtmi FO, 
start- time FO, and stop- time FO for the current, previous and 
subsequent syllables), is performed by determining the minimum 
value of these parameters and subtracting it from each. The 
change in FO through the syllables is therefon^ described inde- 
pendently of the absolute height of the FO movement. 

VI. APPLICATION OF A TREE-BASED 
STATISTICAL MODEL 

The sentential stress and pitch movements associated with 
accented syllables have been hand labelled in the ATR database 
of 453 utterances using the symbols given in table 1. The prosodic 
transcription was done by only one labeller. 

The automatic prosodic labelling algorithm is still in its in- 
fancy and so the acoustic features described in section V are being 
used only to identify any given syllable in an utterances as either 
unstressed, stressed or accented (nuclear or otherwise). Distin- 
guishing pitch movement types has not yet been incorporated. 

The acoustic features are used as parameters to a tree-based 
statistical model (using '^S" [3]). The model is trained on all 
but one of the utterances in the database. The tree classifies 
each hand-transcribed sentential stress label on the basis of the 



features given. This tree is then used to predict ihft labels for 
the utterance that was not included in the training set,^ These 
Automatically generated labels are compared with those given by 
hand. This process is repeated in a cyclic fashion foi^ all the utter- 
ances and the comparisons are summed. The confusion matrix 
(table 2) indicates the number of occurrences that each hand- 
transcribed label is predicted as accented {a, n}, stressed {s) or 
unstressed {u} using this process. 

In order to give an indication of the dependency of the auto- 
matic labels on the method used, table 3 shows a similar confu- 
sion matrix generated when the test utterance is included in the 
training data. 

VTX DISCUSSION 
The misclassification error rate of 26.6% is quite promising 
given that the selection of the acoustic features that have been 
used is based on intuition. This, however, may not be the only 
contributing factor to erroneous classifications. It could be that 
the acoustic features are in fact closely related to the prosodic 
events labelled, but that the tree-based statistical model is not 
the most appropriate method to classify these eventi given the 
acoustic features (this is supported by the considerable difference 
between tables 2 k 3). Alternatively, the acoustic features pre- 
sented could be insufficient to characterise the prosodic events. 
For example, it is likely that representing FO movements across 
a three-syllable window is restrictive, given that such movements 
can clearly span many or part of syllables. It may be that the 
labelling scheme is an inadequate system for describing sentential 
stress and the pitch movements as perceived by the transcriber. 
This can be illustrated by the fact that sentential stress is not a 
simple binary distinction between stressed and unstressed. In am- 
biguous cases, the transcriber uses linguistic knowledge not evi- 
dent in the acoustics. For example, the syllable in question will be 
marked as sententially stressed only if it can be lexically stressed. 
This may lead to every occurrence of schwa being marked as un- 
stressed regardless of the acoustic evidence. With such linguistic 
knowledge unavailable to the tree-based model, confusions will 
inevitably arise between the hand labels and automatic lab^b. 

It is most likely that the classification errors are due to some 
combination of all these factors, although the extent to which 
any one factor effects the error rate is difficult to determine. The 
correct-classification rate of 73.4% is, however, close to the per- 
centage of corr 'ating labels between two hand labellers - in the 
prosodic labelling of the Lancaster/IBM spoken English corptis, 
transcribers achieved 72% agreement for seven categories of sen- 
tential stress labels {\, /, V, A, s, u} and 83% agreement for 
the categories "accented"/ "stressed"/ *Hmstressed" [l]. 
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